Neural Networks - Part 2

2016-09-16, Josh Montague

The Plan

  • Quick review of Part 1
  • The library stack (Keras, Theano, Numpy, oh my!)
  • Examples!
    • Classification (Iris)
    • Classification (MNIST)
    • Regression (housing)

Review

Back in Part 1, we looked at some history, motivation, and a simple (if only mostly-working 😐) implementation of a neural network.

Recall the short version of how this worked:

  • there is an array of input "nodes" (one per feature)
  • the input nodes are "fully-connected" to an arbitrary number of nodes in the next, "hidden layer"
  • the value of each of the hidden nodes is computed by taking the inner product of the previous layer with the weights matrix, and then passing that linear combination through an "activation function," $f(\omega^T x)$. We introduced the sigmoid as one possible activation (shown above)
  • when there are many nodes in the hidden layer(s), the weights form a matrix; the weight connecting nodes $i$ and $j$ (in sequential layers) is matrix element $w_{ij}$
  • "forward propagation" through the network is repeating this for each layer in the network until you get to your predicted output layer
  • "backpropagation" is the process of updating the weight matrix elements $\omega_{ij}$ by distributing the prediction error backward through the network according to the prediction error and a chosen loss function
  • forward and backward propagation are repeated a bunch of times until some convergence criteria is achieved

Remember that at least one of the reasons why this is an interesting set of techniques to explore is that they a very different way to think of features in a model. We don't have to specify all of the explicit model features in a data matrix e.g. a column for $x$, a column for $x^2$, and $x*y, x*y^2$, and so on. We're defining a structure that allows for stacking of arbitrary, non-linear combinations of the predefined set of data matrix features; this can lead to a more expressive set of features. On the other hand, it also means many more degrees of freedom, which increases computational complexity and decreases interpretability.

Moving beyond our for loop in Part 1, we can look at some more reasonable approaches to using neural networks in practice! In particular, we'll look at Keras, one of the active and growing libraries for building, training, and using neural networks in Python.

I think it'll be helpful to understand the stack of libraries and their roles, so hang tight while we run through that, first...

Keras

Keras is a modular library with a scikit-learn-inspired API. It lets you write readable Python code to define the structure of a neural network model, as well as (optionally) detailed configuration of how the model should evaluate.

From the docs:

Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Use Keras if you need a deep learning library that:

  • allows for easy and fast prototyping (through total modularity, minimalism, and extensibility).
  • supports both convolutional networks and recurrent networks, as well as combinations of the two.
  • supports arbitrary connectivity schemes (including multi-input and multi-output training).
  • runs seamlessly on CPU and GPU.

There are many libraries for creating neural networks in Python. A quick google search includes:

  • Keras
  • TensorFlow
  • PyBrain
  • Blocks
  • Lasagne
  • Caffe
  • nolearn
  • PyML
  • ... and I'm sure there are more

I read the docs for a few, read some reddit and StackOverflow discussions, and asked some practioners that I know for their opinions. My takeaway: if you're working in Python, familiar with scikit-learn, and want a good on-ramp to neural networks, Keras is a good choice.

For more discussion about the library and it's motivations, check out the recent Quora Q&A in which the lead developer gave some great insight into the design and plans for the library.

Most of this session will involve writing code with Keras. But Keras doesn't actually do the computation; it uses another library for that (in fact, more than one). For the symbolic computation portion, Keras currently supports Theano (the default) and TensorFlow. We'll use Theano for this notebook.

Theano

From the docs:

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data.

Essentially, by using symbolic mathematical expressions, all sorts of compiler and computational optimizations (including automatic differentiation and dynamically-generated C code!), Theano can make math happen very fast (either using the Python interpreter and numpy, or going right around it to CPU/GPU instructions). An interesting feature of Theano is that executing the same code with a GPU is achieved by simply setting a shell environment variable!

One way to think about how these pieces relate to one another is (loosely):

scikit-learn:numpy :: keras:theano(+numpy)

Put another way, here's my attempt at a visual version:

Ok, enough talk, let's build something


In [ ]:
from IPython.display import Image
Image(data="img/mr-t.jpg")

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
seed = 1234; np.random.seed(seed)
import seaborn as sns

In [ ]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

%matplotlib inline

Foreword: "random"

Many of the numerical computations that we do in Python involve sampling from distributions. We often want these to be truly random, but computers can only provide so-called "pseudo-random" numbers (wiki, Python docs).

In many of our modeling use-cases (particularly those which are ad hoc), having fluctuations in some pseudo-random values is fine. However, when there are variations in, say, the initial conditions for an algorithm, it can lead to variation in the outcome which is not indicative or representative of any true variance in the underlying data. Examples include choosing the starting centroids in a k-means clustering task, or choosing the weights of a neural network synapse matrix.

When you want results to be reproducible (which is generally a good thing!), you have to "seed" the random number generator (RNG). In this way, when you send your code to someone else's computer, or if you run your code 10,000 times, you'll always have the same initial conditions (for parameters that are randomly generated), and you should always get the same results.

In a typical Python script that runs all in one session, the line above (seed = 1234; np.random.seed(seed)) can be included once, at the top*. In an IPython notebook, however, it seems that you need to set the seed in each cell where the random parameter initialization may occur (i.e. in any cell that includes the declaration of a NN model). I'm not 100% positive about this, but this is what I gathered from my experimentation. This is the origin of the assorted calls to np.random.seed() you'll see below!

1: Classification (Iris)

To get a sense for how Keras works, we'll start with a simple example: the golden oldie, the iris data set. That way we can focus our attention on the code, not the details of the data.

Furthermore, to illustrate the parallels with scikit-learn, let's run through that demo first. Since the Keras API is sklearn-like (and this team has lots of sklearn experience), hopefully that will provide some helpful conceptual hooks.

sklearn


In [ ]:
# import data (from seaborn, bc it gives you a df with labels)
iris = sns.load_dataset("iris")

iris.tail()

In [ ]:
# inspect
sns.pairplot(iris, hue='species')

In [ ]:
# get train/test split (no preprocessing) 
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
y = iris['species'].values

# take a 75/25 split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=seed)

# verify array sizes
#[x.shape for x in [X_train, X_test, y_train, y_test]]

In [ ]:
# fit default LR model
model = LogisticRegression()
model.fit(X_train, y_train)

In [ ]:
# score on test (should be ~80-90%)
print("Accuracy = {:.2f}".format(model.score(X_test, y_test)))

Not bad for less than ten lines of code!

In practice, we should be a bit more careful and consider some additional work:

  • preprocess the data (scaling, normalizing)
  • use cross validation techniques to build an uncertainty or confidence (e.g. k-fold cv)
  • gridsearch the model parameters
  • ... etc. ...

But for now, we're just trying to show the comparison between the libraries, and this will do.

Now, let's write the same kind of classification system in Keras!

keras

Warning! If you have a dataset the size of the iris data (tiny!), you probably shouldn't use a neural network in practice; instead, consider a model that is more interpretable. We're using #tinydata here because it's simple and common.

(One-hot) encode the labels

We can start the same train- and test-split data arrays. But, we have to make a modification to the output data (labels). scikit-learn estimators transparently convert categorical labels e.g. strings like "virginica" and "setosa" into numerical values (or arrays). But we have to do that step manually for the Keras models.

We want the model output to be a 3x1 array, where the value at each index represents the probability of that category (0, 1, or 2). The format of this training data, where the truth is 1 and all the other possible values are 0 is also known as a one-hot encoding.

There are a few ways to do this:

  • pandas.get_dummies() (we'll use this one)
  • scikit-learn's LabelEncoder()
  • Keras' np_utils.to_categorical()
  • ... or roll your own

Here's an example of how pd.get_dummies() works:


In [ ]:
# create a sample array with a few of each species from the original df
species_sample = iris.groupby(by='species').head(3)['species']

species_sample

In [ ]:
# get a one-hot-encoded frame from the pandas method
pd.get_dummies(species_sample, prefix='ohe')

Now, instead of a single string label as our output (prediction), we have a 3x1 array, where each array item represents one of the possible species, and the non-zero binary value gives us the information we need.

scikit-learn was effectively doing this same procedure for us before, but hiding all of the steps that map the labels to the prediction arrays.

Back to our original data: we can one-hot encode the y arrays that we got from our train-test split earlier, and can re-use the same X arrays.


In [ ]:
# encode the full y arrays
ohe_y_train = pd.get_dummies(y_train).values
ohe_y_test = pd.get_dummies(y_test).values

Define, compile the model

Time to make our neural network model!

Keras has an object-oriented syntax that starts with a model, then adds layers and activations.

The Sequential model is the main one we care about - it assumes that you'll tell it a series of layers (and activations) that define the network. Subsequently, we add layers and activations, and then compile the model before we can train it.

There is art and science to choosing how many hidden layers and nodes within those layers. We're not going to dive into that in this session (mostly because I don't yet know the answers!), so maintain your skepticism, but just sit with it for now.


In [ ]:
# create a new model
model = Sequential()

# add layers
# - the first hidden layer must specify the dimensions of the input layer (4x1, here)
# - this adds a 10-node, fully-connected layer following the input layer
model.add(Dense(10, input_dim=4))

# add an activation to the hidden layer
model.add(Activation('sigmoid'))

For now, we'll stick to a 3-layer network: input, hidden, and output.

The final, output layer needs to have three nodes since we have labels that are 3x1 arrays. So, our layers and sizes are: input (4 nodes), hidden (10 nodes), and output (3 nodes).

At this point, I only have a small amount of guidance for choosing activation layers. See the notes at the end of the notebook for a longer discussion. Importantly, when we want our output values to be between 0 and 1, and to represent probabilities of our classes (summing to 1), we choosing the softmax activation function.


In [ ]:
# add the output layer, and a softmax activation
model.add(Dense(3))
model.add(Activation('softmax'))

Finally, we compile the model. This is where we can specify the optimizer, and loss function.

Since we're using multi-class classification, we'll use the categorical_crossentropy loss function. This is the advice that I was able to find most often, but I need to learn more about decision criteria for both optimizers, and loss functions. They can have a big effect on your model accuracy.


In [ ]:
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=["accuracy"])

Finally, we fit() the compiled model using the original training data, including the one-hot-encoded labels.

The batch_size is how many observations are propagated forward before updating the weights (backpropagation). Typically, this number will be much bigger (the default value is 32), but we have a very tiny data set, so we artificially force this network to update weights with each observation (see the Warning above).


In [ ]:
# keras uses the same .fit() convention
model.fit(X_train, ohe_y_train, batch_size=1, nb_epoch=20, verbose=1)

We can evaluate() our accuracy by using that method on the test data; this is equivalent to sklearn's score().


In [ ]:
loss, metrics = model.evaluate(X_test, ohe_y_test, verbose=0)

# score on test (should also be ~80-90%)
print("Accuracy = {:.2f}".format(metrics))

Not bad!

There are also sklearn-like methods that return class assignment and their probabilities.


In [ ]:
classes = model.predict_classes(X_test, verbose=0)
probs = model.predict_proba(X_test, verbose=0)

print('(class) [ probabilities ]')
print('-'*40)

for x in zip(classes, probs):
    print('({}) {}'.format(x[0],x[1]))

Now, more compact...

We walked through that in pieces, but here we can collect all of those steps together to see just how few lines of code it required (though remember that we did have the one additional step of creating one-hot-encoded labels).


In [ ]:
np.random.seed(seed)

# instantiate the model
model = Sequential()

# hidden layer
model.add(Dense(10, input_shape=(4,)))
model.add(Activation('sigmoid'))

# output layer
model.add(Dense(3))
model.add(Activation('softmax'))

# set optimizer, loss fnc, and fit parameters 
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=["accuracy"])
model.fit(X_train, ohe_y_train, batch_size=1, nb_epoch=20, verbose=0)

# score on test set
loss, metrics = model.evaluate(X_test, ohe_y_test, verbose=0)
print("Accuracy = {:.2f}".format(metrics))

Or - even more succinctly - we can build the same model but collapse the structure definition because of Keras' flexible API...


In [ ]:
np.random.seed(seed)

# move the activations into the *layer* definition
model = Sequential([
            Dense(10, input_dim=4, activation='sigmoid'),
            Dense(3, activation='softmax'),
            ])

model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=["accuracy"])
model.fit(X_train, ohe_y_train, batch_size=1, nb_epoch=20, verbose=0)

loss, metrics = model.evaluate(X_test, ohe_y_test, verbose=0)
print("Accuracy = {:.2f}".format(metrics))

Cool! It seems to work pretty well.

Peeking inside the model

At this point, what is the model we created? In addition to it's network structure (layers with sizes and activation functions), we also have the weight matrices.


In [ ]:
for layer in model.layers:
    print('name: {}'.format(layer.name))
    print('dims (in, out): ({}, {})'.format(layer.input_shape, layer.output_shape))
    print('activation: {}'.format(layer.activation))
    # nb: I believe the second weight array is the bias term
    print('weight matrix: {}'.format(layer.get_weights()))
    print()

Saving the model

If you're looking to save off a trained network model, these are most of the pieces that you'd need to save to disk. Keras uses HDF5 (sort of "named, organized arrays") to serialize trained models with a model.save() (and corresponding .load()) method.

If you're looking to save the definition of a model, but without all of the weights, you can write it out in simple JSON or YAML representation e.g. model.to_json().

2: Classification (MNIST)

Let's do one more familiar classification problem - last year's 4C dataset: the MNIST image labeling task.

This time we will:

  • have more data (good!)
  • do a tiny bit of data normalization (smart!)
  • build a bigger network (more expressive!)

In [ ]:
from keras.datasets import mnist

In [ ]:
# the data, shuffled and split between tran and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

print("X_train original shape", X_train.shape)
print("y_train original shape", y_train.shape)
print("y_test original shape", y_test.shape)

Remember that the MNIST data is an array of 28-pixel by 28-pixel "images" (brightness values), 60k in the training set, 10k in the test set.


In [ ]:
plt.figure(figsize=(8,4))

for i in range(3):
    plt.subplot(1,3,i+1)
    plt.imshow(X_train[i], cmap='gray', interpolation='none')
    plt.title("Label: {}".format(y_train[i]))

Preprocessing and normalization

If you recall from the last 4C event, the first step we mostly used in preprocessing this data is to unroll the 2D arrays into a single vector.

Then, as with many other optimizations, we'll see better results (with faster convergence) if we standardize the data into a smaller range. This can be done in a number of ways, like sklearn's StandardScaler (zero-mean, unit variance), or Normalize (scale to unit-norm). For now, we'll just rescale to the range 0-1.

Then, we also need to one-hot encode our labels again.


In [ ]:
# unroll 2D pixel data into 1D vector
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)

# convert from original range (0-255) to 0-1 
X_train = X_train / X_train.max()
X_test = X_test / X_test.max()

# OHE the y arrays
ohe_y_train = pd.get_dummies(y_train).values
ohe_y_test = pd.get_dummies(y_test).values

Now we'll built another Sequential model.

This time, we'll use the more commonly-used relu ("rectified linear unit") activation function.


In [ ]:
np.random.seed(seed)

model = Sequential([
            Dense(512, input_dim=784, activation='relu'),
            Dense(512, activation='relu'),
            Dense(10, activation='softmax')
        ])

The shape of this network is now: 784 (input nodes) => 512 (hidden nodes) => 512 (hidden nodes) => 10 (output nodes). That's about $784*512*512*10 \approx 2x10^9$ weights!


In [ ]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, ohe_y_train, batch_size=128, nb_epoch=5, verbose=1)

In [ ]:
loss, metrics = model.evaluate(X_test, ohe_y_test, verbose=1)

print()
#print('Test loss:', loss)
print('Test accuracy:', metrics)

If you recall the 2015 4C leaderboard, a score of 98% would have put you in the top 10% of submissions!

Speaking only for myself, the entries that I submitted in that range took much more time and effort than those last few notebook cells!

3: Regression (Housing)

Finally, let's do an example of modeling a continuous variable - a regresssion task. We'll use another of the canned datasets: the Boston housing price data.

This data comprises a few hundred observations of neighborhoods, each of thirteen related features. The target is the median price of the homes in that area (in thousands of dollars). So, this means that the output variable is a continuous, real (and positive) number.

You can uncomment the print(...'DESCR') cell for a longer description.


In [ ]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [ ]:
# load + inspect data
boston = load_boston()

X = boston.data
y = boston.target
labels = boston.feature_names

b_df = pd.DataFrame(X, columns=labels)

b_df.head()

In [ ]:
# built-in information about the dataset and features
#print(boston.get("DESCR"))

Since the feature values span many orders of magnitude, we should standardize them for optimization efficiency. Then we can split the data into our train/test split.

It's worth noting that we could also experiemnt with standardizing the output variable, as well. For now, we won't.


In [ ]:
# standardize the feature data (all features now 0-1)
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

In [ ]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=seed)

In [ ]:
# build model
np.random.seed(seed)

model = Sequential([
            # use a single hidden layer, also with 13 nodes
            Dense(13, input_dim=13, activation='relu'),
            Dense(1)
        ])

In [ ]:
# compile + fit model
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])

model.fit(X_train, y_train, batch_size=5, nb_epoch=100, verbose=0)

In [ ]:
# evaluate on test data
loss, metrics = model.evaluate(X_test, y_test, verbose=1)

In [ ]:
#print('Test loss:', loss)
#print('Test accuracy:', metrics)
print('MSE:', metrics)

y_pred = model.predict(X_test)
print('R^2 score:', r2_score(y_test, y_pred))

In [ ]:
plt.figure(figsize=(8,8))

# compare the predictions to test
plt.plot(y_test, y_pred, 'o', alpha=0.75, label='model predictions')

# draw a diagonal
xy = np.linspace(min(y_test), max(y_test))
plt.plot(xy, xy, '--', label='truth = pred')

plt.title('3-layer NN')
plt.xlabel('truth ($k)')
plt.ylabel('prediction ($k)')
plt.legend(loc='best')

Cool!

It looks like our model struggles a bit with high-valued observations. Something worth digging into if we were to work on optimizing this model for this task.

BUT

Just to remind you that this is a toy problem that probably shouldn't be solved with a neural network, let's look at the corresponding linear regression model.

We use the same data....


In [ ]:
model = LinearRegression()

model.fit(X_train, y_train)

In [ ]:
y_pred = model.predict(X_test)

print('R^2:', r2_score(y_test, y_pred))

And get similar $R^2$ values with a much more interpretable model. We can compare the prediction errors to the same chart from before...


In [ ]:
plt.figure(figsize=(8,8))

# compare the predictions to test
plt.plot(y_test, y_pred, 'o', alpha=0.75, label='model predictions')

# draw the diagonal
xy = np.linspace(min(y_test), max(y_test))
plt.plot(xy, xy, '--', label='truth = pred')

plt.title('Linear Regression')
plt.xlabel('truth ($k)')
plt.ylabel('prediction ($k)')
plt.legend(loc='best')

And - the reason why a linear model should often be preferred - we can just look straight at the feature coefficients and read off how they relate to the predictions


In [ ]:
plt.figure(figsize=(8,8))

# where to position the bars/ticks
locs = range(len(model.coef_))

plt.barh(locs, model.coef_, align='center')
plt.yticks(locs, b_df.columns);

plt.title('linear regression coefficients')
plt.xlabel('value')
plt.ylabel('coefficient')

Wrap Up

Hopefully between Part 1, and now this - Part 2 - you've gained a bit deeper understanding for how neural networks work, and how to use Keras to build and train them.

At this point, the only thing we haven't really illustrated is how to use them at the #bigdata scale (or with unconventional data types) where they have proven particularly valuable. Perhaps there will be a Part 3...

What next?

If you want to follow up (or go deeper) on the concepts that we covered, here are some links

Acknowledgements

In addition to the links already given, most of this notebook was cobbled together based on other examples I found online, including:


In [ ]: